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Abstract 

Trace-norm regularization plays a vital role in 
many learning tasks, such as low-rank ma¬ 
trix recovery (MR), and low-rank representa¬ 
tion (LRR). Solving this problem directly can be 
computationally expensive due to the unknown 
rank of variables or large-rank singular value 
decompositions (SVDs). To address this, we 
propose a proximal Riemannian gradient (PRG) 
scheme which can efficiently solve trace-norm 
regularized problems defined on real-algebraic 
variety A4 < r of real matrices of rank at most 
r. Based on PRG, we further present a sim¬ 
ple and novel subspace pursuit (SP) paradigm for 
general trace-norm regularized problems without 
the explicit rank constraint A4< r . The proposed 
paradigm is very scalable by avoiding large-rank 
SVDs. Empirical studies on several tasks, such 
as matrix completion and LRR based subspace 
clustering, demonstrate the superiority of the 
proposed paradigms over existing methods. 

1 Introduction 

Trace-norm regularization plays a vital role in various ar¬ 
eas, such as machine learning mm, data mining G3, 
computer vision and image processing ll26l |34] l46l . Most 
trace-norm based problems can be formulated into the fol¬ 
lowing general formulation l25ll : 

min ||X||* + AT(E), s.t. A(X) + B( E) = D, (1) 

X.E 

where A is a regularization parameter, ||X||* is the trace- 
norm (also known as the nuclear-norm) of a matrix X £ 
R mxn , both A and B are linear operators depending on 
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specific applications |25|, D denotes data or observations, 
E can be considered as an error term and T (E) is a regu- 
larizer on E which is possibly non-smooth. The trace norm 
||X||^ is the tightest convex lower bound to the rank func¬ 
tion rank(X) f35l , and the minimization of (|T]i encourages 
the variable X to be low-rank lfT4l fT5l f9ll. Among various 
trace-norm based problems, the low-rank matrix recovery 
(MR) (8), and low-rank representation (LRR) (1261 . have 
gained particular interest in the last decade. 

MR (8) seeks to recover a low-rank matrix X from partial 
observations that are recorded in a vector d £ R ; , where 
l <C rnn. If there are no outliers in the observations, one can 
recover X with high probability by solving the following 
problem (8| |2T| : 

min||X||*, s.t. 2l(X) = d, (2) 

which can be deemed as a simplified version of formulation 
£Q>. MR has been successfully applied in many tasks such 
as matrix completion filtering f37ll36il . However, the recov¬ 
ery performance by solving problem (0 might be seriously 
degraded if the observations contain severe outliers CH 
003. To improve the robustness, we may introduce an ad¬ 
ditional variable E into the constraint A(X) = d as in f|T}. 
and regularize it using fi-norm regularization ( i.e ., ||E||i) 
or f 2 ,i-norm regularization (i.e., 11E| | 2 ,i) nulla- 

LRR |26l seeks to find a low-rank representation X £ 
M nxn of given data D £ R mxra by solving an optimiza¬ 
tion problem of the following form: minx.E A||X||* + 
||E|| 2.1 s.t. DX + E = D, where D denotes the given 
data with n samples, and ||E|| 2,1 encourages the represen¬ 
tation error E to be column-wise sparse. LRR has been 
widely applied in many real-world tasks such as motion 
segmentation and face clustering |26l[25l|45| . 

Many algorithms have been proposed to solve trace-norm 
regularized problems Q3] [23] [T8] [24] [43] [50] [7] [4U l39l . 
but most focus on solving problem ©, such as the 
singular value thresholding (SVT) 0, augmented La- 
grangian method (ALM) and alternating direction method 
(ADM) ll24l gU [39]. Unlike these methods, some re- 





searchers proposed to solve an equivalent problem to prob¬ 
lem (0 (USED: min x ||X||* + 4||A(X) - d|||, where 7 
is a regularization parameter. This problem is known as the 
matrix lasso, and can be addressed by proximal gradient 
(PG) or accelerated proximal gradient (APG) lfl2l |43l |T9| . 

The optimization of problem 0 is more challenging due 
to the additional variable E. By minimizing X and E alter¬ 
natively, the aforementioned methods (e.g., ADM and PG) 
have been extended to solve this problem |;24l [25l. 

The above methods have shown great success in prac¬ 
tice mm- However, the optimization usually involves 
repetitive S VDs due to the SVT operation, making them in¬ 
efficient on large-scale problems ||29ll44 l. Using homotopy 
strategies and applying rank prediction techniques may ac¬ 
celerate the convergence speed with truncated SVDs f43l 
1213125]. However, the rank prediction could be non-trivial 
in general, and large-rank SVDs is still inevitable if the op¬ 
timal solution has a large rank. 

To develop more scalable algorithms, some researchers 
have tackled a version of the problem in which it is as¬ 
sumed that the rank of X is known, e.g.,rank(X) = r, 
and thus proposed to solve a variational form of problem 
© B2HHD: minc.H ||G|| J. + ||H|| J., s.t. T(GH t ) = D, 
where G G l raxr and H G K nxr . Many methods have 
been developed to address this problem, such as gradi¬ 
ent based methods l35l fl8l and stochastic gradient meth¬ 
ods 07] [230. However, these methods may still suffer 
from slow convergence speeds mm. 

Recently, fixed-rank methods by exploiting the smooth ge¬ 
ometry of matrices on fixed-rank manifolds have shown 
great advantages in computation for solving matrix re¬ 
covery problems m 0 m, such as the low-rank ge¬ 
ometric conjugate gradient method (LRGeomCG) m, 
the quotient geometric matrix completion method (qGe- 
omMC) l30l . and the method of scaled gradients on Grass- 
mann manifolds for matrix completion (ScGrassMC) 1331 . 
However, these methods can only deal with smooth objec¬ 
tives. Moreover, the rank parameter r is usually unknown 
in practice, and nontrivial to discover. 

Motivated by the superiority of Riemannian gradient-based 
methods on low-rank matrix recovery problems 04), in this 
paper, we exploit classical proximal gradient methods and 
geometries of the real-algebraic variety M< r to address 
the problem in 0. The main contributions of this paper 
are as follows: 

• We propose a proximal Riemannian gradient (PRG) 
scheme to address trace-norm regularized problems 
with explicit rank constraint rank(X) < r. By ex¬ 
ploiting geometries on M < r , PRG avoids repetitive 
large-scale SVDs of classical proximal methods, mak¬ 
ing it more scalable. 

• To address general trace-norm regularized problems 
in ( 1 ), we present a simple and novel active subspace 


framework which incorporates PRG as a slave solver. 
This framework does not require the prior knowledge 
of r and large-rank SVDs, and can even accelerate the 
convergence speed of PRG. 

2 Notations and Preliminaries 

Let the superscript T denote the transpose of a vec¬ 
tor/matrix, 0 be a vector/matrix with all zeros, diag(v) 
be a diagonal matrix with diagonal elements equal to v, 
(A.B) = tr(AB T ) be the inner product of A and B, 
and ||v|| p be the f p -norm of a vector v. Let A be a lin¬ 
ear operator with A* being its adjoint operator. The op¬ 
erator max(er,v) operates on each dimension of cr. Let 
X = Udiag(er)V T be the SVD of X G R™ xn . The nu¬ 
clear norm of X is defined as ||X||* = ||cr||! = \a t \ 
and the Frobenius norm of X is defined as ||X||^ = ||<t|| 2 . 
Lastly, for any convex function fl(X), let <9fi(X) denote 
its subdifferential at X. 

We now introduce some of the basic notions of the geome¬ 
try of fixed-rank matrices and matrix varieties as follows. 

Geometries of Fixed-rank Matrices. The fixed rank -r 
matrices lie on a smooth submanifold defined below 

M r = {X G R mxn : rank(X) = r} 

= {Udiag(cr )V T : U G St™ V G St”, ||<x || 0 = r}, 

where St™ = {U G R mxr : U T U = 1} denotes the 
Stiefel manifold of m x r real and orthonormal matrices, 
and the entries in cr are in descending order (44l . Moreover, 
the tangent space Ty_M r at X is given by 

TxAU = (UMV T +U p V T +UVj : M G R rXr ,U p G R mxr , 
UjU = 0,V P G R nxr .v p V = 0}. (3) 

Given X G M r and A, B G Ty_M r , by defining a metric 
9x(A, B) = (A,B), M r is a Riemannian manifold by 

restricting (A,B) to the tangent bundle The norm of 
a tangent vector ( Y 6 TyM r evaluated at X is defined as 

iicxii = v / (c^y. 

Once the metric is fixed, the notion of the gradient of an 
objective function can be introduced. For a Riemannian 
manifold, the Riemannian gradient of a smooth function 
/ : M r -> K at X G M r is defined as the unique tangent 
vector grad/(X) in TyM r , such that (grad/(X),£) = 
D/(X)[£], G Ty^M r . As M r is embedded in R mxn , 
the Riemannian gradient of / is given as the orthogonal 
projection of the gradient of f onto the tangent space. 
Here, the orthogonal projection of any Z G M mxn onto 
the tangent space Ty_M r at X = Udiag(er)V T is defined 
as 

PtxAu(Z) : Z PuZP v + P^ZP V + P V ZP^. (4) 

*The tangent bundle is defined as the disjoint union of all tan¬ 
gent spaces TMr = Uxe.M r { X } x 7xAf r . 




where Pjj = UU T and Pjj = I — UU T . Moreover, define 
P To M< r ( Z) = 0 when X = 0 02j. Letting G = V/(X) 
be the gradient of /(X) on vector space, it follows that 

grad/(X) = P T ^ Mr (G). (5) 

The Retraction mapping on M r relates an an element in 
the tangent space to a corresponding point on the manifold. 
One of the issues associated with such retraction mappings 
is to find the best rank-r approximation to X + £ in terms 
of the Frobenius norm 

Rx{£) =PmJX + H) 

= argmin||Y-(X + |)|| F . ( 6 ) 

YeMr 

In general, this problem can be addressed by performing 
SVD on X + £, which may be computationally expensive. 

Remark 1. Since £ = UMV T +U p V T +UVj £ T x M r , 
i?x (^) can be efficiently computed as in Algorithm 6 
in f44\! with efficient QR decompositions on low rank ma¬ 
trices Up and Vp. The corresponding time complexity is 
I4(m+n)r 2 +CsvDi" 3 , where r -C min(m, n) and Csvd 
is a moderate constant MU?. 

Varieties of Low-rank Matrices. Note that the submani¬ 
fold Ai r is open, and the manifold properties break down 
at the boundary where rank(X) < r, and the convergence 
analysis on Ai r will be difficult accordingly l38l . There¬ 
fore, it would be more convenient to consider the closure 
of Mr- 

M<r = {X £ R mx " : l-ank(X) < r}, (7) 

which is a real-algebraic variety f38l . Let ran(X) be 
the column space of X. In the singular points where 
rank(X) = s < r, we will construct search directions in 
the tangent cone lf38l (instead of the tangent space) 

T x M< r = T X M S © {H r - S €U ± €) (8) 

where U = ran(X) and V = ran(X T ). Essentially, H r _ s 
is a best rank-(r —s) approximation of G — PtxAT s (G), 
which can be cheaply computed with truncated SVD of 
rank (r — s). Let grad/(X) £ T x M< r be the projection 
of G on T-x_M<r- It can be computed by 

grad/(X) = P TxMs (G) + E r _ s . (9) 

Given a search direction / £ TxM< r , we need perform 
retraction which finds the best approximation by a matrix 
of rank at most r as measured in terms of the Frobenius 
norm, i.e., 

^x (0 = arg min ||Y - (X+ £)||f. ( 10 ) 

Y E-M< r 

Since E,._ s £ U 3 - © V x , i?x(£) w.r.t. M< r can be effi¬ 
ciently computed with the same complexity as on M r (see 
details in the supplementary file). 


3 Proximal Riemannian Gradient on M< r 

Directly solving the general trace-norm regularized prob¬ 
lem in ([I]) can be computationally expensive due to the un¬ 
known rank of variables (regarding fixed-rank methods) or 
large-rank singular value decompositions (SVDs) (regard¬ 
ing proximal gradient based methods). To make the prob¬ 
lem simpler, let us first consider the following problem with 
explicit rank constraint rank(X) < r 

minx.E ||X||* + AT(E), (11) 

s.t. A(X) +B(E) = D, X £ M< r - 

Here, the parameter r is supposed to be known. Never¬ 
theless, based on dTTl . we will propose a subspace pursuit 
paradigm to solve the general trace-norm regularized prob¬ 
lem in 0}; see details in Section 4. 

The penalty method is adopted to deal with the equality 
constraint -4(X) + £>(E) = D in (ITTI i. and it minimizes a 
penalized function over M< r in the following forn{]: 

®(X, E) = ||X||* + AT(E) + 7jj|-4(X) + B( E) - D|||, (12) 

where 7 is a penalty parameter. Note that when there are 
no outliers, we can let 6 (E) = 0 and Y(E) = 0, and the 
objective function 'L (X, E) is reduced to 

$(X) = ||X||. +J|^(X)-D|||,, X £ M< r - (13) 

T(X) is also the objective function of the matrix lasso 
problem I43HT91 , so one can adapt classical proximal meth¬ 
ods 02121 to address it. However, proximal gradient 
methods which directly operate on vector spaces could be 
very expensive if large-rank SVDs are required. 

In this section, we extend classical proximal methods on 
vector space 02021181 , and propose a proximal Rieman¬ 
nian gradient scheme to minimize (fl2l > and ([L3l > by exploit¬ 
ing geometries over the matrix variety M < r |j 

3.1 PRG on M < r for Non-outlier Cases 

The objective function T (X) regarding non-outlier cases is 
much simpler than T(X. E). When there are no outliers, 
we solve the following optimization problem: 

min ||X||* +/(X), s.t. rank(X) < r, (14) 

where /(X) is any smoothing function, for example 
/(X) = 4p ( X)-D||^. 

To introduce proximal methods on M < r , similarly as 
in 02 El, we introduce a local model of ’L(X) on M< r 

2 If D is a vector, the E-norm will be replaced by the fZ-norm. 

’Recall that At< r is a closure of the Riemannian submanifold 
Air- Here, we abuse “Riemannian” for simplicity. 



around Y gM< r but keeping ||X||* intact: 

m L (Y; X) := ||X||, + /(Y) + (grad/(Y),0 + | 

where £ G TyM.< t and X = Y + £. Note the above local 
model is different from that on vector spaces (see EHE]) 
in the sense that X — Y = £ is restricted on 7VA4 < r . 

Similar to classical proximal gradient methods I32ll43l , our 
proximal Rimannian gradient method solves problem ( 1 1 4b 
by minimizing mj,(Y;X) on ,A/!< r iteratively. In other 
words, given Y = X& in the fcth iteration, we need to solve 
the following optimization problem to obtain X^+i: 

min toz,(Y;X), s.t. X G M< r - (15) 

For convenience, let 

T L (Y):=arg min m L ( Y; X) (16) 

be a minimizer of (fl5l >. Then it can be computed as follows. 

Lemma 1. Let £ = — gradf(Y)/L. Denoting the SVD 
of Ry(£,) as Ry(£) = U+diag((T + )Vl, it follows that 
T l (Y) = U + diag(max(<r + - l/L,0))Vj. 

Please find the proof in supplementary file. 

Remark 2. 77, (Y) can be efficiently computed in the sense 
that 7?y( 0 = U+<ftag(<r+)Vj can be cheaply computed 
without expensive SVDs. 


Algorithm 1 Proximal Riemannian Gradient for Solving 
Problem (fTTb. 

Require: Xo, penalty parameter 7 , parameter r, stopping 
tolerance e. 

1 : For k = 1,..., K 

2 : Compute grad/(X^_i) according to @ or <|9]). 

3: Choose L k to satisfy (fl7i>, and set X t . = T Lk (X k _ 1 ). 

4: Terminate if stopping conditions are achieved. 

5: End 

6 : Return X* : . 


PRG iteratively minimizes a local model of T on ,A/l< r , as 
shown in Algorithm Q] PRG consists of two major steps 
1) compute a search direction in Step 3, and 2) update 
Xfc according to X& = 2"i, fe (Xfc_ 1 ). Here, 1 / L k can be 
deemed as the step size, and it can be determined using 
Armijo line search. Specifically, given a descent direction 
Cfc £ 2x fc M< r , L k is determined such that 

^( 7 i fc (Xfc))<^(X fe ) + /3(grad/(X fc ), G k )/L k , (17) 
where /3 G (0,1). 


Optimality condition of (fl 4 l) . A point X* G A4< r is a lo¬ 
cal minimizer of (IT4l) if and only if there exists g G d\ |X| |* 
such that ED 

grad/(X) + « = 0. (18) 

The following lemma guarantees the existence of L k . 

Lemma 2. Let X/- G M< r , and C k e lx A' l< r be a de¬ 
scent direction. Then there exists an L k that satisfies the 
condition in ( 1771) . 

Proof Since C, k is a descent direction, it follows that 
0 ^ grad/(X fc ) + 5||X||* and (grad/(X fc ), C k ) < 0. 
Since 7), (X/,.) is continuous in L, there must exist an L 
such that tt(T L (X fc )) < ^(X fe ) + /3(grad/(X fc ), £ k )/L, 
ML G [L, +00). □ 

In general, optimization methods on Riemannian manifolds 
are guaranteed to be locally convergent, and it is nontrivial 
to check whether a limit point X* is a global solution or 
not. However, for PRG, the limit point X* will be a global 
solution if r > rank(X*). 

Theorem 1. Let {X/.} be an infinite sequence of iterates 
generated by Algorithm{J} Then every accumulation point 
of {Xfc} is a critical point of f over A4< r . Furthermore, 
limfc-^oo || gradf(X. k ) + Cl If = 0 . Let X* denote the 
limit point. In particular, if rankfX.*) < r, then we have 
V/(X*) + C, = 0, i.e., X* is a global optimum to ( 1771 ). 

Proof. Note that T (X) is bounded below. The proof can 
be completed by adapting the proof of Theorem 3.9 in E 8 l . 

□ 

Stopping conditions of PRG. For simplicity, we stop PRG 
if the following condition is achieved: 

where e denotes a tolerance value. 

3.2 Robust PRG on M< r with Outliers 

Now, we extend PRG to minimize T(X. E) in (fT2l> regard¬ 
ing the outlier cases. For convenience, define 

/(X,E)^|j]A(X) + B(E)-D|||. 

We then need to solve the following problem: 

min ||X||* + AT(E) + /(X,E). (20) 

XeAt<,.,E 

Following l25l , we optimize the two variables X and E us¬ 
ing an alternating approach. Let the pair (X&, E/ l: ) denote 








Table 1: Computation of .S'a(B). 


T (E) 

MR: ||E||i 

LRR: | |E| | 2 ,l 

Sx( B) 

sgn(B) © max(|B| - ^-,0) 

max(||bj || — — ,0) 

[Sx(B)]i =— a b„vi 


the variables obtained from the fc-iteration. At the (k + l)th 
iteration, we update X and E as below: 

To update X, we fix E = E/ i: and minimize a local model 
ofT'(X,E) w.r.t. X: 

m L (X;X fc ,E fc ) := ||X||* +/(X fc , E fe ) 

+ (grad/(X fe ,E fc ),0 + i/2(|,0, 

where X = X*. +£,$,£ T^ k M< r , and L is a pos¬ 
itive number. Let T k (X kl E^) denote the minimizer of 
ttil (X; X&, E/-). Then T k (X k: E/.) can be computed ac¬ 
cording to LemmaQ] where L is determined by Armijo line 
search to make a sufficient decrease of the objective. 

To update E, we fix X = X^+i and solve a problem: 
min AT(E) + J||^(X fc+1 )+ B(E) - D|||,. (21) 

E Z 

Solving this problem with general B would be very diffi¬ 
cult. However, for MR and LRR, B( E) = E and Y(E) is 
either ||E||i or ||E||2,1 ■ As a result, the problem ([Til) has a 
closed-form solution. Let us define B& = D ,4(Xfc+i). 
Then B/, is a vector for MR and a matrix in the form of 
Bfc = [bf,..., bfj for LRR. The closed-form solution, 
denoted by S\( B^), is shown in Table Q] In cases where 
the problem (EH cannot be solved in closed-form, one may 
adopt iterative procedures to solve it. 

The detailed algorithm, which is referred to as robust PRG 
(RPRG), is shown in Algorithm |2] Due to the possible 
ill-conditioned issuesQ we apply a homotopy continuation 
technique to accelerate the convergence speed. Starting 
from an initial guess Ao, we set X k = min(Aop fc_1 , A) and 
compute Efc = S\ k (X&, Efc_i), where p is chosen from 
(0,1). Clearly, X k is non-increasing w.r.t. k. 


Algorithm 2 Robust PRG for Solving Problem (fill) . 
Require: Initial (Xo,Eo), parameter A and 7, initial Ao, 
parameter r, p £ (0,1), stopping tolerance e. 

1: For k = 1, K 

2 : Let X k = max(Aop fc_1 , A). 

3: Compute grad/(X fe _i, E fc _i) by (0 or 0. 

4: Choose L k by Armijo line search. Set X^ = 

2i fc (Xfc_i,Efc_i). 

5: Compute E k =S\ k (B fc _i), where B fe _i =D-A(X k ). 
6: Terminate if stopping conditions are achieved. 

7: End 

8: Return (X fc ,E fe ). 

4 When A is very small or 7 is very large, ||E||i can be very 
large at the beginning due to the thresholding in Table 1, making 
Xi far from its optimum. 


We discuss the convergence as follows. 

Proposition 1. Let 'L(X fc ,E fc ) = ||X||* + A fc T(E) + 
/(X, E), and {(Xfc,Efc)} be an infinite sequence of 
iterates generated by Algorithm \2} It follows that 
tf(X fc+ i,E fc+ i) < ^(Xfc.Efc), and {(X fc ,E fc )} con¬ 
verges to a limit point (X*, E*). 

Please find the proof in supplementary file. Thee stopping 
condition in (fl9l ) can be extended to RPRG by replacing 
'k(X) with ^(X, E). 

4 Subspace Pursuit for Solving Problem CD) 

Both PRG and RPRG methods operate on M< r , which 
rely on the knowledge of r. Unfortunately, the parameter r 
is usually unknown. Based on TheoremQ] one should set 
a sufficiently large r such that r > rank(X*), where X* is 
an optimal solution of problem 0. However, the compu¬ 
tational cost will dramatically increase if r is too large. 

Regarding the above issues, we propose a subspace pursuit 
(SP) paradigm to address problem 0. To introduce SP, we 
bring in an additional integer n which is assumed to be 
several times smaller than rank(X*). Taking the outlier 
cases for example, instead of doing RPRG with a large r, 
we gradually increase the rank of X from rank(X) = 0 
( e.g ., X = 0) by a small value k, and perform RPRG to 
the following subproblem with increased t. 

min \I/(X, E), s.t. rank(X) < tn, (22) 
X,E 

where t, denotes the iteration index. Essentially, the sub¬ 
space pursuit addresses problem 0 by solving a series of 
subproblems in (ET > using RPRG on A4 <tK (see Step 8 of 
Algorithm 0, where t = 1 and T denotes the maxi¬ 
mum number of iterations. 

Algorithm 3 Subspace Pursuit for Solving 0. 

Require: Parameters k, A and 7, initial Ao, x, p (where 
X < p), and stopping tolerance e and e. 

1: Initialize X° = 0, E° = 0, and Aq = Ao- 

2: For t = 1,..., T 

3: Let r = tn and s = (t — l)n. 

4: Let Aq = A* -1 , and A 4 = max(AoX*, A). 

5: Compute G = 7 ^*(yf(X 4 - 1 ) + S(E t_1 ) - D). 

6: Computegrad/(X t_1 , E t_1 )=Pt x t _ 1 ;vi s (G) + S^ _1 . 
7: Choose L t to satisfy QT). Set X£ = TzJX 4 - 1 ^ 4 " 1 ) 
and Eq = S x t (B 4 " 1 ), where B 4 - 1 =D - A(X 4 ). 

8: Update (X 4 , E 4 ) by calling RPRG to address (E2l) with 
initial input (Xq, Eq), r, Aq, A 4 , and e. 

9: Terminate if stopping conditions are achieved. 

10: End 


At the fth iteration, SP either stops or increase r from tn 
to (t + 1) k and the domain of X changes from a Rieman- 
nian manifold Mt. K to AT<(t+i) K . As a result, we need to 


















compute 3 r _ s in Step 6 to calculate grad/(X* 1 ,E t 1 ) 
according to equation (9). 

To accelerate the convergence speed, two techniques are 
critical, i.e. the warm start and homotopy continuation 
techniques. To warm start, we prepare a good initial guess 
of (Xg, Eq) for RPRG in Steps 4-6 (Steps 4-6 are exactly 
Steps 4-5 in Algorithm [2J. To apply the continuation tech¬ 
nique in RPRG. To facilitate its usage in SP, in Step 8 , we 
adaptively set the initial Ao and target A for RPRG by Ag 
and A* (see Step 4), respectively. Note the parameter \ 
should be smaller than p in RPRG. 

Note that PRG can be also incorporated into the SP frame¬ 
work. For convenience, we refer SP with RPRG and PRG 
to as SP-RPRG and SP-PRG, respectively. 

SP increase the rank by re iteratively. Due to limited size 
of X, SP will be stopped in limited steps. Moreover, the 
objective value T(X. E) monotonically decrease w.r.t. t. 

Proposition 2. Let {X f , E 3 4 } be the sequence generated by 
Algorithm\3\ Then we have 

T(X 4 ,E 4 ) < $(X*- 1 ,E t - 1 )-j 8 ||S‘- 1 |||,/L t . (23) 


Please find the proof in supplementary file. 

Stopping conditions of SP. Let X£ be the limit point of 
RPRG at the ith iteration. According to Theorem 0 Q if 
rank(Xj) < ire, Xjf must be a global solution. As a result, 
SP will stop at the tth iteration. According to Proposition 
[21 if \\a'~ |||n is very small, then there is no need to pro¬ 
ceed. Then we may also stop the iterations if the following 
condition is achieved: 


^(x 4 - 1 ^ 4 - 1 ) - T>(X 4 ,E 4 ) 
^(X 4 - 1 ^ 4 - 1 ) 


(24) 


where e denotes a tolerance value. 


4.1 Parameter settings 

For convenience of parameter setting, we suggest choosing 
the penalty parameter 7 in (l 22 l according to 7 — 1 /(z/oi), 
where v is a scaling factor, and er denotes the singular vec¬ 
tor of A* (D)0 For robust cases, the parameter A in (fl2l) is 
chosen by A = Syd m , where d m denotes the mean of |D|. 
The integer re is chosen such that <Ji > peri, Vi < re, and 
< 7 re +i < 70 - 1 . Without loss of generality, we suggest set¬ 
ting v e (0.0001,0.01), S <E (0.01,1], and rj £ (0.5, 0.9). 
One may also apply cross-validations to choose v and <5 re¬ 
garding model parameters. Lastly, in SP, we do not need to 
optimize each subproblem accurately. Therefore, in SP we 
suggest setting e = 0.01 for PRG and RPRG. 


3 The setting of 7 is consistent with the setting of p in matrix 

lasso in (43}, where 7 = 1 , and p = va 1 is suggested in general. 


4.2 Complexity Analysis 

The complexity of SP mainly includes two parts, i.e., the 
computation of H(. which can be done by truncated S VD of 
rank re and the subproblem optimization by PRG or RPRG. 

Here, we focus on the complexity of SP on MR. At the 
tth iteration of SP, the complexity of PRG or RPRG is 
0((m + n)(fre) 2 + Ztre), where ire < r + re. For suf¬ 
ficiently sparse matrices like in MR, the truncated SVD 
of rank re in SP can be completed in 0((m + n)re) using 
PROPACK f22ll : while the truncated SVD in existing prox¬ 
imal gradient based methods takes 0((m + n)r), where 
re is several times smaller than r. Note that in general SP 
needs only [Y/re] times of truncated SVDs. Therefore, SP 
is cheaper than existing proximal gradient based methods 
on MR. The complexity comparison on LRR can be found 
in supplementary file. 

5 Related Work 

The authors in [31] exploited Riemannian structures and 
presented a trust-region algorithm to address trace-norm 
minimizations. The proposed method, denoted by MMBS, 
alternates between fixed-rank optimization and rank-one 
updates. However, this method shows slower speed than 
APG on large-scale problems Oil . The authors in (28} pro¬ 
posed a Grassmannian manifold method to address trace- 
norm minimizations on a fixed-rank manifold. In general, 
this method has similar complexity to ScGrassMC that also 
operates on Grassmannian manifold If33l 

Active subspace methods or greedy methods, that increase 
the rank by one per iteration, have gained great attention 
in recent years nans sana. However, these meth¬ 
ods usually involve expensive subproblems, and might be 
very expensive when the true rank is high. For example, 
Laue’s method (231 needs to solve nonlinear master prob¬ 
lems using the BFGS method, which can be very expensive 
for large-scale problems. More recently, (171 proposed a 
novel active subspace selection method for solving trace- 
norm regularized problems. However, this method may 
suffer from slow convergence speed due to the approx¬ 
imated SVDs and inefficient solvers for the subproblem 
optimization. In (42), the authors proposed a Riemanian 
pursuit (RP) algorithm which increases the rank more than 
one. However, this algorithm cannot deal with trace-norm 
regularized problems. 

6 Experimental Results 

We evaluate the proposed methods on for two classical 
trace-norm based tasks, namely low-rank matrix comple¬ 
tion and LRR based clustering. All the experiments are 
conducted in Matlab (R2012b) on a PC installed a 64- 
bit operating system with an Intel(R) Core(TM) i7 CPU 
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Figure 1: Performance of various methods on TOY1. 
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(a) Relative objective difference w.r.t. time. 
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Figure 2: Performance of various methods on TOY2, where 5% of data are disturbed by severe outliers. While other 
methods over-fit on this data, the proposed robust PRG (RPRG) and SP-RPRG still achieve promising testing RMSE. 


(3.2GHz with single-thread mode) and 64GB memory. 


the source codes are not available. 


6.1 Experiments on Matrix Completion 

We study the performance of proposed methods, namely 
PRG, RPRG, SP-PRG and SP-RPRG, on the matrix com¬ 
pletion task. Three state-of-the-art trace-norm based meth¬ 
ods, e.g. APG 11431 . MMBS lHH , and Active ALT fl7l , are 
adopted as baselines. Moreover, to study the efficiency of 
proposed methods on matrix completion, we also compare 
several efficient fixed-rank methods, including RP ll42l , 
LMaFit m, ScGrassMC ED, LRGeomCG 04], qGe- 
omMC l30ln We do not report the results of some meth¬ 
ods, such as IALM j24l and method in [28], since they are 
either slower than the compared methods in this paper or 


We adopt the root-mean-square error (RMSE) testing set 
as a major evaluation metric: RMSE = |PY>(D — 
x *)||f/V(P|)> where X* denotes the recovered matrix, 
and fl denotes the index set of a testing set, and 'Po de¬ 
notes the orthogonal projection onto 11441 1 . 


6.1.1 Synthetic Experiments 

Following ll33ll42l . we generate ground-truth low-rank ma¬ 
trices D = Udiag(<x)V T £ R mxm 0 f r> where 
U £ St™, V £ St™, m = 5000, r = 50 and er is a 
50-dimensional vector with its entries sampled from a uni¬ 
form distribution [0,1000], We sample l = ujr(2m — r) 
entries from D uniformly as the observations stored in 
6 APG is available from http: / /www. math . nus . edu . sg/ ~mdt£oMfpX®hHrS thtimlin oversampling factor 1 24] , Here, 
RP is available from http://www. tanmingkui.com/rp.htmlwe set u = 2.5. We study two toy data sets whose ob- 

Active ALT is available from ser vations are perturbed by two kind of noises: In the first 

http://www.cs.utexas.edu/~cjhsieh/; , , r , . , ,, ... 

MMBS, LMaFit, ScGrassMC, qGeomMC toy data set TOY1, each entry of d is perturbed by addi- 

and LRGeomCG are available from: tive Gaussian noise of magnitude 0.011|d|| 2 /1|n|| 2 , where 

http://www.montefiore . ulg. ac.be/~mishra/fixedrank£fUSkaslfiaG&u^sianL vector with each entry being sam- 
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(a) Relative objective difference w.r.t. time. (b) Testing RMSE values w.r.t. time 


Figure 3: Performance of various methods on Movie-10M data set. 


pled from iV(0,1); The second toy data TOY2 is obtained 
based on TOY1, by further perturbing 5% of the observa¬ 
tions with outliers uniformly sampled from [—10,10]. In 
these synthetic experiments, we set v — 0.005 r\ = 0.65, 
and 5 = 0.1. 

Three trace-norm based methods APG, MMBS and Active 
ALT, are adopted as the baselines. The Relative objective 
difference and Testing RMSE w.r.t. time on TOY1 and 
TOY2 are reported in FiguresQ]and[2] respectively. 

According to Figure |l(a)| our proposed PRG, RPRG, SP- 
PRG and SP-RPRG converge much faster than the com¬ 
parators, and SP-PRG and SP-RPRG improve upon their 
counterparts (i.e., PRG and RPRG) significantly. From Fig¬ 
ure [1(b)] the testing RMSE shows similar trends to the ob¬ 
jective values. Note that, our methods thus achieve low 
RMSE values in very short times. In general, the Active 
ALT method is slower than others, which may bedue to the 
approximated SVDs and inefficient solvers for the subprob¬ 
lem optimization. 

From Figure [2(a)l on TOY2 which is disturbed by outliers, 
our proposed method in general converges faster than the 
baselines. However, from Figure |2(b)| only the proposed 
RPRG and SP-RPRG achieve promising testing RMSE val¬ 
ues. While other methods over-fit the data after several it¬ 
erations due to the outliers. Not also that SP-RPRG con¬ 
verges faster than its counterpart RPRG. This observation 
demonstrates the effectiveness and efficiency of our pro¬ 
pose methods. 

6.1.2 Experiments on Real-world Data 

We study the performance of SP-PRG and SR-PRG on 
three collaborative filtering data sets: MovieLens with 
10M ratings (denoted by Movie-IOM) (l6l . Netflix Prize 
dataset (20l and and Yahoo! Music Track 1 data set ED. 
The statistics of these data sets are recorded in Table [2] 

In the first experiment, we only compare with the three 


Table 2: Statistics of datasets. 


Data set 

m 

n 

m 

Movie-IOM 

71,567 

10,677 

10,000,054 

Netflix 

48,089 

17,770 

100,480,507 

Yahoo 

1,000,990 

624,961 

252,800,275 


trace-norm based methods, e.g. APG 03, MMBS E| 
and Active ALT fl7l on Movie-IOM. We report the change 
of Relative objective difference and Testing RMSE w.r.t. 
time in Figure [3] Here, we randomly choose 80% of the 
ratings as training set and the remainder as the testing set. 
From Figures |3(a)| and |3(b)| our proposed methods show 
much faster convergence speed as well as faster decreasing 
of testing RMSE values. 

In the second experiment, the baseline methods in¬ 
clude APG 03, MMBS El, LRGeomCG (44), qGe- 
omMC El, Lmafit (47), Active ALT 0/7], Sc- 
GrassMC l33l and RP (42]. The ranks returned by SP- 
PRG are used as the rank estimations for fixed-rank meth¬ 
ods, such as ScGrassMC, qGeomMC and Lmafit. We set 
v = 0 . 00177 = 0.65, and <5 = 0.7. Following (23ll40llT8l. 
we report the testing RMSE of different methods over 10 
random 80/20 training/testing partitions. 

Comparison results are shown in Tabled Note that we did 
not obtain the results of some methods on two larger data 
sets, i.e. Netflix and Yahoo, due to the very expensive com¬ 
putation cost. We thus leave the results blank. According 
to the table, the proposed SP-PRG and SP-RPRG meth¬ 
ods achieve better testing RMSE than RP with comparable 
time. Note that RP relies on carefully designed stopping 
conditions to induce low-rank solutions, and cannot deal 
with outliers (42] ■ In particular, SP-PRG and SP-RPRG 
achieve significant improvements in terms of testing RMSE 
on the Yahoo data set. 



































Table 3: Experimental results on real-world datasets, where 
time is recorded in seconds. 


Method 

Movie- 10M 

Netflix 

Yahoo 

RMSE 

Time 

RMSE 

Time 

RMSE 

Time 

APG 

1.094 

810.01 

1.038 

2883.80 

- 

- 

LRGeomCG 

0.823 

57.67 

0.860 

2356.86 

25.228 

18319 

QgeomMC 

0.836 

96.41 

0.897 

9794.75 

24.167 

82419 

Lmafit 

0.838 

133.86 

0.876 

2683.73 

24.368 

24349 

ALT 

0.855 

917.17 

- 

- 

- 

- 

MMBS 

0.821 

441.10 

- 

- 

- 

- 

ScGrassMC 

0.845 

216.07 

0.892 

4522.68 

24.954 

37705 

RP 

0.818 

46.56 

0.858 

1143.02 

23.451 

12456 

SP-PRG 

0.817 

53.42 

0.855 

1057.35 

22.644 

15972 

SP-RPRG 

0.815 

67.73 

0.857 

1245.15 

22.537 

17263 



Datasets 

(a) Running time (in log scale). 



Datasets 

(b) Clustering accuracy. 


Figure 4: Comparing the running times and clustering ac 
curacies of different LRR solvers on two datasets. 


SVDs w.r.t. large matrices. Moreover, our algorithm 
achieves comparable clustering performance with ll25ll . In 
contrast, the LRR solver in |[26l achieves lower clustering 
accuracy on the HARUS dataset, possibly because that al¬ 
gorithm is not guaranteed to obtain a globally optimal so¬ 
lution. 

7 Conclusion 

Classical proximal methods may require many large-rank 
SVDs when addressing the trace-norm regularized prob¬ 
lems on vector spaces. To overcome this, we first pro¬ 
pose a proximal Riemannian gradient (PRG) method to ad¬ 
dress trace-norm regularized problems over a matrix vari¬ 
ety A4< r , where r is supposed to be known. By perform¬ 
ing optimization on M< r , PRG does not require SVDs, 
thus can greatly reduce the computation cost. A robust 
version of PRG method has also been proposed to handle 
the outlier cases. To address general trace-norm regular¬ 
ized problems, a subspace pursuit strategy is proposed by 
iteratively activating a number of active subspaces. Exten¬ 
sive experiments on two classical trace-norm based tasks, 
namely low-rank matrix completion and LRR based clus¬ 
tering, demonstrate the superior efficiency of the proposed 
methods over other methods. 


6.2 Experiments on LRR Based Subspace Clustering 

To compare our proposed SP-RPRG method with the exist¬ 
ing LRR solvers in li25ll26l . we conduct experiments on the 
Extended Yale Face Database B (ExtYaleB) for face clus¬ 
tering, and the Human Activity Recognition Using Smart¬ 
phones dataset (HARUS) 0 for human activity cluster¬ 
ing. Following j26l , the clustering performance is mea¬ 
sured by clustering accuracy, namely the number of cor¬ 
rectly clustered samples over the total number of samples. 

The ExtYaleB dataset contains 2,414 frontal face images 
of 38 subjects with different lighting, poses and illumina¬ 
tion conditions, where each subject has round 64 faces. 
Following (27l . we use 640 faces from the first 10 sub¬ 
jects. Each face image is resized to 48 x 42 pixels and 
then reshaped as a 2016-dimensional gray-level intensity 
feature. The HARUS dataset is a large dataset (containing 
10,299 signals w.r.t. 6 activities) with data collected using 
embedded sensors on the smartphones carried by volun¬ 
teers on their waists, when they are conducting daily ac¬ 
tivities ( e.g ., walking, sitting, laying). The captured sensor 
signals are pre-processed to filter noise and post-processed. 
Finally, a 561 -dimensional feature vector with time and fre¬ 
quency domain variables is extracted for each signal. 

The best clustering accuracies and the corresponding run¬ 
ning times are reported in Figure |4(b)| and Figure |4(aj| re¬ 
spectively. It can observed that, our SP-RPRG method 
outperforms the two existing LRR solvers in terms of ef¬ 
ficiency, since our algorithm does not frequently involve 
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